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?_i . 

Multiple outcomes, both continuous and discrete, are routinely 

gathered on subjects in longitudinal studies and during routine clini- 
cal follow-up in general. To motivate our work, we consider a longitu- 

\& , dinal study on patients with primary biliary cirrhosis (PBC) with a 

continuous bilirubin level, a discrete platelet count and a dichotomous 
indication of blood vessel malformations as examples of such longi- 

f\ ' tudinal outcomes. An apparent requirement is to use all the outcome 

values to classify the subjects into groups (e.g., groups of subjects 
with a similar prognosis in a clinical setting). In recent years, nu- 
merous approaches have been suggested for classification based on 
longitudinal (or otherwise correlated) outcomes, targeting not only 
^i , traditional areas like biostatistics, but also rapidly evolving bioinfor- 

matics and many others. However, most available approaches con- 
sider only continuous outcomes as a basis for classification, or if non- 
^ . continuous outcomes are considered, then not in combination with 

00 ' other outcomes of a different nature. Here, we propose a statistical 

method for clustering (classification) of subjects into a prespecified 
number of groups with a priori unknown characteristics on the ba- 
sis of repeated measurements of several longitudinal outcomes of a 
different nature. This method relies on a multivariate extension of 

^^ ■ the classical generalized linear mixed model where a mixture dis- 

C^) ' tribution is additionally assumed for random effects. We base the 

inference on a Bayesian specification of the model and simulation- 
based Markov chain Monte Carlo methodology. To apply the method 
in practice, we have prepared ready-to-use software for use in R 

p\ . (http://www.R-project.org). We also discuss evaluation of uncer- 

tainty in the classification and also discuss usage of a recently pro- 
posed methodology for model comparison — the selection of a number 
of clusters in our case — based on the penalized posterior deviance 
proposed by Plummer [Biostatistics 9 (2008) 523-539]. 
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2 A. KOMAREK AND L. KOMARKOVA 

1. Introduction. 

1.1. Data and the research question. In clinical practice multiple markers 
of disease progression, both continuous and discrete, are routinely gathered 
during the follow-up to decide on future treatment actions. Our work is 
motivated by data from a Mayo Clinic trial on 312 patients with primary 
biliary cirrhosis (PBC) conducted between 1974-1984 [Dickson et al. (1989)]. 
This longitudinal study had a median follow-up time of 6.3 years with a 
large number of clinical, biochemical, serological and histological parameters 
recorded for each patient. The data are available in Fleming and Harrington 
(1991), Appendix D, and electronically at http://lib.stat.cmu.edu/datasets/ 
pbcseq. 

With these data, we shall mimic a common problem from the clinical 
practice: at a prespecified time point from the start of follow-up we want to 
use the values of the markers of the disease progression to identify groups of 
patients with similar characteristics. That is, we want to perform a cluster 
analysis using the longitudinal measurements. With these motivating data, 
we perform a classification of patients who survived without liver transplan- 
tation the first 910 days (2.5 years) of the study (N = 260), the data being 
further referred to as PBC910. This corresponds to the practical problem 
outlined above, that is, clustering of patients being available at a given time 
point. For the purpose of this paper, the time point of 910 days was se- 
lected arbitrarily. Its choice in other application can, of course, be driven 
by practical or other considerations. The following markers will be consid- 
ered for the cluster analysis: continuous logarithmic serum bilirubin (Ibili), 
discrete platelet count (platelet) and dichotomous indication of blood vessel 
malformations (spiders); see Figure 1. 

In a clinical routine usually only the last available measurements reflect- 
ing the current patient status are used to identify the prognostic groups — 
clusters. Clearly, such a procedure ignores the available information on the 
markers' evolution over time, which might be more important for reasonable 
classification than simply the last known state. To remedy this deficiency, 
we shall propose a clustering method exploiting jointly the whole history 
of longitudinal measurements of all considered markers which might have a 
different nature from being continuous to discrete, or even dichotomous. 

1.2. Basic notation and data characteristics. Let Yj )7 . = (Yj r i,..., 
^,r,n ir ) T denote a random vector of the longitudinal profile of the rth 
marker (r = 1, . . . ,R) pertaining to the zth subject (i = 1, . . . , N). Further, 
let Yj = (Y^, . . . ,Yj R ) T be a random vector of all longitudinal measure- 
ments on the rth subject and Y = (Yj , . . . , Y^) T a random vector repre- 
senting all available outcomes. As usual, let j/i >r j , yi^ r , yj , y denote the ob- 
served counterparts of corresponding upper case random variables and vec- 
tors. Throughout the paper, we will assume the independence of subjects, 
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Table 1 

Data PBC910. Characteristics of the time points at which the longitudinal values of the 

markers for clustering were taken. For each marker r and each visit j , n*j gives the 

number of available measurements, med, Q± and Q3 are the median, the lower and the 

upper quartile of the time points in months when the measurements were taken 

Ibili (r = 1) platelet (r = 2) spiders (r = 3) 



ti,r,j (months) ti,r,j (months) *i,r,j (months) 



med (Q1Q3) n*j med (Q1Q3) n*j med (Q1-Q3) 



1 


260 


0.0 


(0.0-0.0) 


256 


0.0 


(0.0-0.0) 


260 


0.0 


(0.0-0.0) 


2 


248 


6.1 


(5.9-6.7) 


241 


6.1 


(5.9-6.7) 


247 


6.1 


(5.9-6.8) 


3 


226 


12.2 


(11.8-12.9) 


224 


12.2 


(11.8-12.9) 


224 


12.2 


(11.8-12.9) 


4 


181 


24.3 


(23.8-25.3) 


180 


24.3 


(23.8-25.2) 


180 


24.3 


(23.8-25.2) 


5 


3 


23.4 


(23.4-26.5) 


2 


23.4 


(23.4-23.4) 


2 


23.4 


(23.4-23.4) 



that is, independence of Yi, . . ., Yjv- Furthermore, let fi = Ei=iEr=i n v 
be the total number of observations and let t% T ,j be the times (on a study 
time scale) at which the individual values Y%rA (i = lj • • • ■, N, r = 1, . . . , R, 
j = 1, . .. ,nj ir ) were taken. Finally, let p(-) and p(-\-) be generic symbols for 
(conditional) distributions. 

In the PBC910 data and our application, the number of markers R equals 3. 
As it is common with the longitudinal data, numbers n, jr of available mea- 
surements of each marker varies (between 1 and 5) across patients (me- 
dian 4), leading to n = 2734. Further, the distribution of the time points t% r ,j 
also varies across subjects and markers; see Table 1. However, note that the 
fifth visit which is available for only three patients is not outlying with re- 
spect to its timing from the rest of the data set. Indeed, it only corresponds 
to patients with a slightly more frequent visiting schedule. In summary, our 
longitudinal data are heavily unbalanced and irregularly spaced in time, also 
containing 12 patients for whom only baseline marker values at time t = 
are available. It is our aim to also classify or at least suggest a classification 
for those patients. 

1.3. Existing clustering methods, the need for extensions. In the litera- 
ture numerous clustering methods applicable in many different situations 
are available. Nevertheless, as we discuss below, none of them is applicable 
for our problem of clustering where each subject is represented by a set of R, 
in general unbalanced and irregularly sampled longitudinal profiles of mark- 
ers which may have a different nature starting from continuous and ending 
with a dichotomous one. 

1.3.1. Classical approaches and a mixture model-based clustering. Apart 
from classical approaches like hierarchical clustering or i'f-means method 
[see, e.g., Hastie, Tibshirani and Friedman (2009), Chapter 13, Johnson and 
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Wichern (2007), Chapter 12] and their many extensions, model-based clus- 
tering built on mixtures of parametric or even nonparametrically specified 
distributions assumed for random vectors Yi , . . . , Y^r has become quite pop- 
ular in the past decade [e.g., Fraley and Raftery (2002)]. This is probably 
partially due to the availability of ready-to-use software like R [R Devel- 
opment Core Team (2012)] packages mclust of Fraley and Raftery (2006) 
for clustering based on mixtures of multivariate normal distributions, earlier 
versions of the mixAK package described by Komarek (2009) or mixtools of 
Benaglia et al. (2009), which also allows for nonparametric estimation of 
the mixture components. Another rapid evolution of model-based clustering 
algorithms also originates from their need in gene-expression data analy- 
sis [e.g., Newton and Chung (2010), Witten (2011)]. Nevertheless, classical 
approaches, model-based clustering based on mixtures of distributions and 
many other related methods are not applicable in our context. 

Some of the above mentioned methods rely on distances based on a 
suitable metric between the observed values of underlying random vectors 
Yi, . . . , Y^r being viewed as points in the Euclidean space of a certain di- 
mension. However, in our situation the dimension of each Y, is generally 
different for each subject and typically random. Hence, it is even not possi- 
ble to define a common sample space needed to define a reasonable metric 
to calculate the distances. 

For model-based methods, on the other hand, it is necessary to assume 
that the random vectors Yi, . . . , Ytv are independent and, given the classi- 
fication, identically distributed according to a suitable (multivariate) distri- 
bution. Neither of these can be assumed since for typical longitudinal data 
(including our PBC910 data) the measurements are taken at different time 
points for each subject and, hence, Yi,...,Y^ are hardly identically dis- 
tributed even if the number of measurements was the same for all subjects 
(which is also not the case for our data). 

1.3.2. Clustering based on a mixture of regression models. For data where 
the ith subject out of N to be classified may be represented by one response 
random variable Yi and a vector of possibly fixed covariates Xj , several meth- 
ods for clustering based on mixtures of regression models have been devel- 
oped. Among the first, Quandt and Ramsey (1978) assume a two-component 
mixture of two normal linear regressions. Extension into a general number 
of components and also a practically applicable implementation is provided 
by Benaglia et al. (2009). A variant of the clustering based on a mixture 
of regressions with application to gene-expression data is given by Qin and 
Self (2006). A generalization, allowing also for nonnormally distributed re- 
sponse random variables Y\, . . . , lj\r, is due to Grim and Leisch (2007), who 
consider mixtures of generalized linear models. To apply these methods for 
our application, the single response random variables ^, r ,j could play the 
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role of the response variables in the mixtures of regression models and the 
time points U r j the role of the fixed covariates. Nevertheless, the clustering 
approaches based on mixtures of regression models are also ruled out in our 
situation since (a) we cannot assume a single parametric distribution for all 
response variables in the data since both continuous and discrete response 
variables appear in our data set, (b) each subject is in general represented 
by more than one pair (response, covariates). 

1.3.3. Clustering approaches for functional data and stochastic processes. 
For given r € {1, . . . ,R}, a set {Yi )T , . . . , Y^^} of longitudinal trajectories 
of the rth marker could also be viewed as a set of functional observations or, 
in more general, a set of realizations of a certain random process. A similar 
setting is also found in the applications in genomics where Y$ r is typically a 
vector representing the expression curve of gene i over time. In the functional 
data or genomics literature, several clustering methods have been developed 
for situations when it is possible to assume a decomposition of each observed 
value into 

where rrii jr (t) is either the value of the underlying random functional or the 
mean ith gene-expression at time t or, in general, the underlying stochastic 
process, and e» r ,• (i = 1, . . . ,N, j = 1, . . . , rtj r ) are random variables with a 
zero mean and either a common variance a 2 or subject/gene specific vari- 
ances of (i = 1, . . . ,N). Based on this model (1), James and Sugar (2003) 
and Liu and Yang (2009) developed methods for the clustering of functional 
data. Peng and Miiller (2008) proposed a distance-based clustering method 
and apply it to data from online auctions. For the genomics applications, 
Ramoni, Sebastiani and Kohane (2002) present an agglomerative clustering 
procedure based on the autoregressive model in equation (1). Another gene- 
expression clustering application based in fact on a mixture of regression 
models in equation (1) is provided by Ma et al. (2006). In our situation, 
these methods could only be applied if there is only one continuous marker 
available for each patient. Hence, with the PBC910 data, clustering would 
have to be based only on either Ibili values or platelet values if it was as- 
sumed that they come from a continuous location-shift distribution. The 
dichotomous spiders values cannot be used at all. 

1.3.4. Clustering based on mixture extensions of the mixed models. For 
the analysis of the continuous longitudinal data, the linear mixed model 
[LMM, Laird and Ware (1982)] plays a prominent role. For given r G {1, . . . , R}, 
it is based on expression 

yZ) I i >r = X^ r OC r + Z; r Dj r -j- £j «., 2 = l,...,iv, 
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where Xj >r and Zj iT . are vectors of fixed covariates containing the time points 
ti,r,j (j = l,---,Tiir) and possibly other factors. Further, a r is a vector of 
unknown regression parameters, bj !r are i.i.d. random variables — random 
effects with unknown mean f3 r and a covariance matrix D r , and £j >r are 
independent random vectors with zero mean and a covariance matrix Sj ]r . 
To cluster subjects based on the continuous longitudinal data, several ap- 
proaches stemming from a mixture extension of the LMM (2) have been 
proposed in the literature. Verbeke and Lesaffre (1996) assume a normal 
mixture in the distribution of random effects and apply their method to clus- 
tering of growth curves, while Celeux, Martin and Lavergne (2005) consider 
a mixture of linear mixed models and perform clustering of gene-expression 
data. De la Cruz-Mesia, Quintana and Marshall (2008) proceed in a similar 
way, however, they replace the *J r a r + nJ r hi iT part of (2) by a nonlinear 
expression in a r and bj j7 .. 

By a suitable choice of the covariate vectors and imposing a suitable 
structure on the error covariance matrices Ej, it is possible to use the LMM 
(2) also for the analysis of R > 1 continuous longitudinal markers and for 
clustering based on it as was done by Villarroel, Marshall and Baron (2009), 
or could be done using the model of Komarek et al. (2010) who performed 
the discriminant analysis, though. However, analogously to Section 1.3.3, all 
mentioned methods could be used for our application only if we wanted to 
base the clustering only on Ibili and/or platelet values. 

One possible strategy for clustering based on not only continuous longi- 
tudinal profiles is to use a general form of the model proposed by Booth, 
Casella and Hobert (2008) [equation (3) in their paper], where they assume 
that the (not necessarily normal) distribution of longitudinal observations of 
a particular marker depends on cluster-specific parameters and on a vector 
of random effects. Nevertheless, except for this general definition, they focus 
in their paper on a linear mixed model which is only applicable in situations 
when the observed longitudinal markers are continuous. 

A specific option which allows for clustering based on a single longitudinal 
marker of a discrete nature is to replace the LMM (2) by a generalized linear 
mixed model [GLMM, e.g., Molenberghs and Verbeke (2005)] and assume 
a suitable mixture in the distribution of random effects [Spiessens, Verbeke 
and Komarek (2002)]. An example of clustering based on such a model is 
shown in Molenberghs and Verbeke (2005), Section 23.2. Nevertheless, it is 
still not possible to jointly use all three (in general, all R) markers. 

1.3.5. Objectives and outline of the paper. In previous paragraphs we 
gave a brief overview of the most common classes of clustering approaches. 
We also argued that none of them are capable of exploiting jointly irregular 
longitudinal measurements of R > 1 markers of different nature (continuous, 
discrete or even dichotomous) as it is required by the PBC910 data. Even 
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though our overview is by no means exhaustive, we are not aware of any 
method that would meet such needs. For these reasons we propose a clus- 
tering method that will be built upon the multivariate extension (where the 
word "multivariate" points to the fact that R > 1 markers will be modeled 
jointly) of the GLMM with a normal mixture in the distribution of random 
effects proposed by Spiessens, Verbeke and Komarek (2002), [SVK], and 
Molenberghs and Verbeke (2005), [MV]. Not only did they model just the 
R = 1 longitudinal outcome, but they also considered only a homoscedastic 
normal mixture. Nevertheless, this is a rather restrictive assumption, espe- 
cially in our context where each mixture component should represent one 
cluster. Hence, to have a better ground for clustering, the heteroscedastic 
mixture will be considered in our proposal. Further, in their illustrations, 
[SVK] and [MV] usually included at most bivariate random effects. This was 
probably due to the fact that as a method of estimation they exploited the 
maximum-likelihood through the EM algorithm which starts to be computa- 
tionally troublesome for models with random effects of a higher dimension. 
For computational complexity implied by the multivariate extension of the 
mixture GLMM, but not only because of this, we shall use the Bayesian in- 
ference based on the Markov chain Monte Carlo (MCMC) simulation here. 
In Section 2 we next describe the multivariate mixture generalized linear 
mixed model which will serve as the basis for our clustering procedure and 
show how to apply it to the PBC910 data. The clustering procedure will 
be described, and the clustering of patients from the PBC910 data will be 
performed in Section 3. In Section 4 we discuss the possibility of estimating 
a number of clusters needed in situations when this does not follow from the 
context. We evaluate the proposed methodology in Section 5 on a simulation 
study and finalize the paper by a discussion in Section 6. 

2. Mixture multivariate generalized linear mixed model. 

2.1. Model specification. Our proposed clustering procedure is based on 
a multivariate mixture generalized linear mixed model (MMGLMM). We 
first express the conditional mean of each response profile using a standard 
GLMM, that is, 

(3) 

i = 1, . . . , N,r = 1, . . . , R, j = 1, . . . , m tr , 

where /i" 1 is the link function used to model the mean of the rth marker, 
Xjrj, z i,r,j are vectors of known covariates which may include a constant 
for intercept, time values in which the longitudinal observations have been 
taken or any other additional covariates. Further, a r is a vector of unknown 
regression coefficients (fixed effects) and bj >r is a vector of random effects 
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for the rth response specific for the ith subject. We assume hierarchically 
centered GLMM [Gelfand, Sahu and Carlin (1995)] where the random effects 
bi )T ., . . . ,bjv,r have in general a nonzero and unknown mean, let's say (3 r , 
r = 1, . . . , R; see equation (5) below. Being within the GLMM framework, we 
assume that for each i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , n^, the conditional 
distribution p(yi ! r,j\4 > rt a r^i,r) belongs to an exponential family with the 
mean specified by (3), and possibly unknown dispersion parameter <j) r . In a 



is 



sequel, let if) = ((f) , a ) , where a = (a{ , . . . , ajj) , <fi = (<p{ , . . . , 4> R ) 
the vector of GLMM related parameters. 

Further, let bj = (b ^ , . . . , hJ R ) T be a joint vector of random effects for the 
ith subject (i = 1, . . . , N). Dependence between the R longitudinal markers 
of a particular subject i represented by the response vectors Y^i, . . . , Y^r 
is taken into account by assuming a joint distribution for the random effect 
vector bj which also grounds our clustering procedure. We assume that the 
ith subject belongs to one of a fixed number of K clusters (see Section 4 for 
possible approaches to choose K if this does not follow from the context of 
the application at hand), each cluster with a probability Wk = P(ui = k\w), 
(0 < Wk < 1, k = 1, . . . ,K, Ylk=i w k = 1)) where m € {1, . . . ,K} is the ith 
subject allocation and w = (w\, . . . ,wk) T ■ We further assume that the cor- 
responding random effect vector bj follows a multivariate normal distribu- 
tion with an unknown mean fj, u . and a (generally nondiagonal) unknown 
covariance matrix D Ui , that is, 

p(bi\e,Ui = k) = v3(bj;/x fc ,D fc ), i = 1, . . . ,N,k = l, . . . ,K, 

where y?(-|/x, D) is a density of the (multivariate) normal distribution with a 
mean /^ and a covariance matrix D, and 9 = (w , fij , . . . , fij^, vec(Di), . . . , 
vec(D^)) T is a vector of unknown parameters related to the distribution of 
random effects. That is, overall, we assume a multivariate normal mixture 
in the distribution of random effects: 

K 
(4) b i \0 iA ^Y / ^kMVAf(n k ,B k ), i = l,...,N. 

k=i 

With this approach, we represent the unbalanced longitudinal observations 
Yi, . . . , Yjv using a set of i.i.d. random vectors bi, . . . , b^v which allows us 
to develop a clustering procedure based on ideas of the mixture model-based 
clustering introduced in Section 1.3.1. 

Finally, we point out that given our model, the mean effect (in a total pop- 
ulation) of covariates included in the vectors Zj >r .j, i = 1, . . . , N, r = 1, . . . , R, 
j = l,...,n it r is given by 

K 



(5) (3 = ^2^^- 



fc=i 
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This is a vector composed of R subvectors, say, /3 l7 . . . ,(3 R , which play the 
role of the fixed effects for covariates included in the z-covariate vectors. 
Hence, for identifiability reasons, it is assumed that the vectors Xi r j and 
z i,r,j, i = 1, ■ ■ ■ ,N, r = 1, . . . ,R, j = 1, . . . , n^r do not contain the same co- 
variates. 

2.2. Likelihood, Bayesian estimation. For largely computational reasons, 
we shall use the Bayesian inference based on the output from the MCMC 
simulation. To this end, the model must be specified also from a Bayesian 
point of view. A priori, we assume the independence between the mixture 
related parameters and the GLMM related parameters ip. That is, the 
prior distribution p(i/», 6) factorizes as p(i/>, 0) = p(ip) x p(6). The factoriza- 
tion of the prior is typical in generalized linear mixed models and might be 
justified in our application because the parameters involved in the ij) and 6 
vector, respectively, express different features of the model. Specifically, for 
p(0), we use a multivariate version of the classical proposal of Richardson 
and Green (1997) as prior distribution, and, for p(ip), we adopt classically 
used priors in this context [see, e.g., Fong, Rue and Wakefield (2010)]. A de- 
tailed description of the assumed form of p(ip,6) is provided in Appendix 
A of the Supplementary Material [Komarek and Komarkova (2013)] where 
we also explain how to choose the prior hyperparameters in order to obtain 
a weakly informative prior distribution. 

The likelihood of the MMGLMM follows from (3) and (4): 

N / K \ 

(6) ^iM)=Ky|^.0) = n(I>* L *.*(^'*))' 

»=i \fc=i / 

where 

» C R n i,r ~\ 

L it k(i/>,Q)= I] ~[p(yi,r,j\(t)r,a r ,hi )r ) \p(hi\6,Ui = k)dhi, 

■' [r=lj=l ) 

(7) 

i = l,...,N,k = l,...,K 



L, . . . , i 1 , ,V J-, . . . , . 



is the contribution of the ith subject to the likelihood under the assump- 
tion that the random effects are distributed according to the kth. mixture 
component. 

MCMC methods are used to generate a sample Sm = {(i/)( m \o( m >) : 
m = 1, . . . , M} from the posterior distribution p(ip, 0\y) oc L(ip, 6) x p(ip, 6). 
Namely, a block Gibbs algorithm is used with the Metropolis-Hastings steps 
for those blocks of model parameters where the normalizing constant of the 
full conditional distribution does not have a closed form. A well-known iden- 
tifiability problem which arises from the invariance of the likelihood under 
permutation of the component labels is solved by applying the relabeling 
algorithm of Stephens (2000), which is suitable for mixture models targeted 
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toward clustering in particular. For details of the MCMC algorithm, refer 
to Appendix B of the Supplement [Komarek and Komarkova (2013)]. 

2.3. MMGLMM for the PBC910 data. The MMGLMM for the cluster- 
ing of patients included in the PBC910 data will be based on longitudinal 
measurements of (1) logarithmic serum bilirubin (Ibili, Yhj), (2) platelet 
counts (platelet, li,2j) and (3) dichotomous presence of blood vessel mal- 
formations (spiders, Yi,3,j) with assumed (1) Gaussian, (2) Poisson and (3) 
Bernoulli distribution, respectively. Exploration of the observed longitudinal 
profiles (see also Figure 1) suggests the following form of the mean struc- 
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Fig. 1. PBC910 data. Observed values of the longitudinal markers. Thick lines show 
cluster- specific marginal mean evolution over time based on posterior means of the mixture 
means /j, 1 (green) and /i. 2 (red). Observed values of dichotomous blood vessel malformations 
(spiders) are vertically jittered. 
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ture (3): 

( 8 ) log{E(Fj :2)i m 2 )} = 6j,2,l + h,2,2ti,2,j, 

logit{E(Y i:3 j\b i:3 ,a 3 )} = b it3 + a 3 t it3J , 

i = 1, .. . ,N, j = 1,. . . , nj, r , r = 1,2,3, where 1 < nj, r < 5. In model (8), £j, r j 
is the time in months from the start of the follow-up when the value of Yi, r j 
was obtained. 

In the main analysis, we will classify patients into two groups and, hence, 
a two component mixture {K = 2) will be considered in the distribution 
of five-dimensional random effect vector b, = (&i,i,i,&j,i,2, h,2, 1,^,2,2, ^i,3) T , 
where 6* 11, bi,2,i, h,3 are random intercepts from the GLMM for each marker 
and 64,1,2,^,2,2 are random slopes from the GLMM for the first two mark- 
ers. The model also involves the fixed effect ex = a 3 , the slope from the 
logit model for the third Bernoulli response and a dispersion parameter 
(pi = var(Yi,ij|bj,i), the residual variance from the Gaussian model for the 
first marker, the logarithmic bilirubin. Let o\ = \[4>\ be the correspond- 
ing residual standard deviation. The GLMM related parameters are thus 
i\) = (01,03) . The results that we report are based on 10,000 iterations of 
1:100 thinned MCMC obtained after a burn-in period of 1000 iterations. See 
Appendix C of the Supplement [Komarek and Komarkova (2013)] for a full 
Bayesian specification of the model, particular choices of the hyperparame- 
ters, an illustration of the performance of the MCMC and detailed results. 

With respect to clustering, the most important parameters are the mix- 
ture weights wi, W2 and the mixture means /i l5 /x 2 which characterize the 
clusters. Their estimates taken to be the posterior means (denoted by w\, W2, 
/2 1 ,/i2; resp.) estimated from an appropriately relabeled MCMC sample are 
given in Table 2 together with the 95% highest posterior density credible 
intervals (HPD CI). The first cluster is thus characterized by a remarkably 
lower baseline bilirubin level and its slower increase over time compared 
to the second cluster. For the platelet counts, there is almost no difference 
between the clusters at the baseline and only a moderate difference with 
respect to the rate of its change, with the second cluster showing a faster 
decline. Finally, blood vessel malformations exhibit a higher probability in 
the second cluster compared to the first one. From the clinical point of view, 
the first cluster exhibits more favorable values and also an evolution of all 
three markers and, hence, it should correspond to patients with a better 
prognosis compared to the second cluster. We confirm this conclusion in 
Section 3.1 upon the classification of the individual patients. 

To get a better idea of the meaning of the clusters, we used /i 1 and Ji 2 to- 
gether with the posterior means of the GLMM related parameters xj) (see Ta- 
ble 2) and the posterior means of the mixture covariance matrices Di and D2 
(see Table 3), and calculated the estimates of the cluster specific (marginal) 
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Table 2 

PBC910 data. Posterior means and 95% HPD credible intervals for mixture weights, 

mixture means and GLMM related parameters 





fe = l 


fe = 2 




wi = E(ioi|y) = 0.598 w? 


= E(tu 2 |y) =0.402 


Parameter 


(0.471, 0.711) 


(0.289, 0.529) 




Logarithmic bilirubin (/fa/7/) 




Intercept 


-0.209 


1.102 


Mfc.i = E (M*,i|y) 


(-0.332, -0.082) 


(0.828, 1.387) 


Slope 


0.00450 


0.01281 


ju fej2 = E(/ifc, 2 |y) 


(0.00056, 0.00818) 


(0.00476, 0.02108) 


Residual std. dev. 


0.314 




ai = E(ai|y) 


(0.294, 0.333) 
Platelet count (platelet) 




Intercept 


5.58 


5.46 


ju fe ,3 = E(/i fc>3 |y) 


(5.49, 5.65) 


(5.35, 5.58) 


Slope 


-0.00567 


-0.00828 


/2 fc ,4 = E(/i k>4 y) 


(-0.00799, -0.00339) (- 


-0.01354, -0.00306) 




Presence of blood vessel malformations (spiders) 


Intercept 


-4.33 


-0.83 


pfe.s = E(/i fei5 |y) 


(-5.90, -2.88) 


(-1.66, -0.02) 


Slope 


0.0280 




S 3 = E(a 3 y) 


(0.0026, 0.0532) 





Table 3 

PBC910 data. Standard deviations (on a diagonal) and correlations (off-diagonal 

elements) for each mixture component derived from the posterior means Di = E(Di|y) 

and D2 = E(D2|y) of the mixture covariance matrices 





Intercept 


Slope 


Intercept 


Slope 


Intercept 




(lbili) 


(lbili) 


(platelet) 


(platelet) 


(spiders) 


Intercept (lbili) 


0.428 


0.031 


fe = l 

-0.282 


-0.086 


0.326 


Slope (lbili) 




0.00837 


0.040 


-0.214 


0.100 


Intercept (platelet) 






0.309 


-0.039 


-0.042 


Slope (platelet) 








0.0105 


0.028 


Intercept (spiders) 






fc = 2 
0.119 




4.02 


Intercept (lbili) 


0.776 


-0.183 


-0.139 


0.171 


Slope (lbili) 




0.03090 


-0.034 


0.249 


0.116 


Intercept (platelet) 






0.398 


-0.046 


-0.191 


Slope (platelet) 








0.0232 


-0.043 


Intercept (spiders) 










2.42 
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mean longitudinal evolutions E(Y. tr) .\a r ,u = k) = E\ 3 {E(Y. !ri .\h,a r ,u = k)}, 
k = 1,2, r = 1,2,3 over time. These are plotted as green (k = 1) and red 
{k = 2) lines on Figure 1. 

3. Clustering procedure. It follows from the decision theory for classi- 
fication [see Hastie, Tibshirani and Friedman (2009), Section 2.4] that the 
optimal classification of the ith subject (i = 1, . . . ,N) is to be based on the 
posterior component probabilities 7Tj & = P(ui = k\y) (k = 1, . . . , K). In our 
case, they are calculated by marginalization over the posterior distribution 
as 



n,k = / Pi,kW>, OM^' e \y) <*(?/>, e ) = E fe,fc(^> 0)|y} 
(9) J 

w M 

m=l 

where by Bayes' rule 

Pi,k{ip, 0) = P{Ui = k\ip, 0, y 4 ) = — R ' — — - , 

/ 10 x 22i=iWiL it i(il>,9) 

i = l,...,N,k = l,...,K. 

The ith subject is classically assigned to the cluster gi for which 7rj j5i is 
largest among (9) [e.g., Titterington, Smith and Makov (1985), McLachlan 
and Basford (1988)]. 

In non-Bayesian applications, the clustering procedure is usually based 
on the values oip^k =Pi,k(' l />i #)> where xj) and 6 are suitable estimates (e.g., 
maximum-likelihood) of the model parameters. Only rarely the uncertainty 
in the estimation of pn~ expressed by evaluating their standard errors or 
calculating the confidence intervals is taken into account. This is probably 
because of the fact that pi t k(ip,@) depend on the model parameters in a 
relatively complex way. In our case of the MMGLMM, it is, for example, 
complicated by the necessity to integrate the GLM likelihood over the as- 
sumed distribution of the random effects [see equation (7)], which in general 
does not have an analytic solution. 

With the Bayesian approach followed in this paper, the values of 7?^ used 
for clustering are the estimated posterior means of Piki^t 0) and with the 
MCMC-based posterior inference, we can easily evaluate also the posterior 
standard deviations (counterparts of the classical standard errors) or the 
credible intervals (counterparts of the classical confidence intervals). These 
can be used to incorporate an uncertainty in the classification which, for 
example, in the clinical setting of the PBC910 data, can serve for the iden- 
tification of subjects that should undergo additional screening before their 
ultimate classification; see Section 3.1. 
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3.1. Clustering of patients from the PBC910 data. Classification of pa- 
tients according to the maximal value of the estimated posterior mean 7?j ^ 
leads to 167 patients being classified in group 1 and 93 patients in group 2; 
see Figure 2. We argued in Section 2.3 that from a clinical point of view, 
the first group should correspond to patients with a better prognosis com- 
pared to the second group. With the PBC910 data, it is possible to confirm 
this conclusion since the information concerning the residual progression free 
survival time, defined as time till death due to liver complications or till liver 
transplantation, is available in the form of the classical right-censored data. 
We calculated Kaplan-Meier estimates of the survival probabilities based on 
data from patients classified in each group. These are plotted as solid lines 
on Figure 3. Indeed, the survival prognosis of group 1 is much better than 
that of group 2 with the estimated 5-year survival probability in group 1 of 
0.934 compared to 0.554 in group 2, and the 10-year survival probabilities 
0.679 and 0.141 in groups 1 and 2, respectively. 

The fact that the posterior means 7Tj k °f the patient specific compo- 
nent probabilities characterize with different certainty the probabilities of 
the allocation for different patients is illustrated by Figure 4. It shows the 
MCMC-based estimates of the posterior distributions of the first component 
probabilities p^i = pi t i(tj),G) for three selected patients who were all classi- 
fied in a better prognosis group 1. For patient A, 7^1 = 0.990 with a very 
narrow 95% HPD CI of (0.970, 1.000) and, thus, her classification in group 1 
is almost certain. This is further confirmed by her progression- free survival 
time, which is almost 14 years. The posterior probability of belonging to 
group 1 is lower for patient C (7^1 = 0.667). Nevertheless, it is still twice 
as high as the posterior probability of belonging to group 2 (7^2 = 0.333) 
and, hence, it seems that patient C also belongs most likely in group 1. 
On the other hand, his progression-free survival time is only 3 years and 4 
months (i.e., only 10 months beyond the time point at which we perform 
classification) and, hence, from a clinical point of view, this patient should 
rather be classified in group 2. In this case, the uncertainty in classification 
is expressed by a very wide 95% HPD CI which is (0.168,1.000), covering 
the majority of the interval (0, 1). 

In clinical practice, the diagnostic procedure usually proceeds in several 
steps, where in each step it is decided whether it is possible to classify 
a patient with enough certainty or whether additional examinations are 
needed before the ultimate classification is determined. Quite naturally, one 
of the diagnostic steps of such a procedure could be based on calculated 
credible intervals for individual component probabilities pi t k = Pi t k(ip,Q)- 
The patient would be ultimately classified in one of the considered groups 
only if the lower limit of the corresponding credible interval exceeds a certain 
threshold, say, 0.5 in the simplest case of classification into K = 2 groups. 
Applying this procedure to the PBC910 data lead to 126 patients being 
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Fig. 2. PBC910 data. Observed values of the longitudinal markers upon classification 
into K = 2 groups. The thick lines show cluster-specific marginal mean evolution over 
time based on posterior means of the mixture means fj, 1 (green) and fi 2 (red). Observed 
values of dichotomous blood vessel malformations (spiders) are vertically jittered. Profiles 
of patients for whom the lower limit of the 95% HPD credible interval did not exceed 0. 5 
are drawn in light blue. 
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Fig. 3. PBC910 data. Kaplan-Meier estimates of survival probability beyond 910 days 
in each group (k = 1: green, k = 2: red) created using the clustering procedure. Solid lines: 
everybody classified using the maximal value 0/7?^, dotted-dashed line: only patients for 
whom the lower limit of the 95% HPD CI for the component probability exceeded 0. 5 were 
classified. 

classified in group 1 and 70 in group 2. In total, 64 patients (41 and 23 
originally classified in group 1 and group 2, respectively, see light blue profiles 
on Figure 2) remain without ultimate classification, and additional screening 
or examinations would have been recommended for them before making 
a final decision. The fact that the two groups consisting of only 126 + 
70 ultimately classified patients better reflect the progression free survival 
status is illustrated by the Kaplan-Meier survival curves (dotted-dashed 
lines on Figure 3), which are now faster, diverting with the estimated 5- 
year survival probability in group 1 of 0.960 compared to 0.465 in group 2, 
and the 10-year survival probabilities 0.748 and 0.108 in groups 1 and 2, 
respectively. 



4. Selection of a number of clusters. Until now, we assumed that the 
number of clusters, K, was known in advance and fixed. In medical applica- 
tions where the found clusters are expected to correspond to certain prog- 
nostic groups, this is quite a reasonable assumption. Nevertheless, in many 
other situations, the number of clusters should rather be inferred from the 
data themselves. 

With our approach, the selection of a number of clusters corresponds to 
the selection of a number of mixture components in the underlying distri- 
bution of the random effects of the MMGLMM. This can also be viewed 
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A: 0.990 (0.970, 1 .000) B: 0.837 (0.656, 0.967) C: 0.667 (0.1 68, 1 .000) 
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Fig. 4. PBC910 data. Histograms of sampled values of component probabilities pi.ify, 9) 
for three selected patients. Above the plot: estimated posterior mean 7?j,i and the 95% HPD 
CI for p iA =Pi,i(i>,0). 



as a problem of model selection or model comparison. Nevertheless, as de- 
scribed, for example, in McLachlan and Peel (2000), Chapter 6, the model 
comparison in the mixture setting is complicated by the fact that the classi- 
cal regularity conditions do not hold. For this reason, use of various sorts of 
information criteria is predominantly preferred to classical testing in most 
frequentist applications of the mixture models. In particular, the Bayesian 
information criterion [BIC, Schwarz (1978)] proved to be a useful tool for 
selecting the number of mixture components [Dasgupta and Raftery (1998), 
Fraley and Raftery (2002), Hennig (2004), De la Cruz-Mesia, Quintana and 
Marshall (2008) and many others]. 

In Bayesian statistics, the Bayes factors [Kass and Raftery (1995)], for 
which the BIC is an approximation, are widely recognized as a tool of 
model selection. However, as pointed out by Plummer (2008), the Bayes 
factors have some practical limitations. First, they cannot be routinely calcu- 
lated from the MCMC output. Second, they are numerically unstable when 
proper, but weakly informative diffused priors (as it is in our case) are used. 
In Bayesian applications, the deviance information criterion [DIC, Spiegel- 
halter et al. (2002)] seems to be the most widely used concept of model 
selection in the past decade. Nevertheless, DIC in the mixture context lacks 
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the theoretical foundations and its use remains controversial [see comments 
and rejoinder on Celeux et al. (2006)]. For these reasons, Plummer (2008) 
suggested basing criteria for model choice on penalized loss functions and 
cross-validating arguments. For mixture models, in particular, he suggested 
using the penalized expected deviance (PED). Since then, it has been suc- 
cessfully exploited in several applications [Komarek (2009), Cabral, Lachos 
and Madruga (2012), De la Fe Rodriguez et al. (2011)], and we shall use it 
as well to choose the number of clusters. 

The penalized expected deviance is defined as 

(11) PEB = E{D(i,,e)\y}+p opt , 

where D(ip, 6) = — 21og{L(i/>, 6)} is the observed data deviance of the model, 
and its posterior mean E{D(tp, 0)\y} (expected deviance) is easily estimated 
from the MCMC sample. Further, the p op t part of equation (11) is the 
penalty term called optimism, which can be estimated by using the two 
parallel MCMC chains and importance sampling [Plummer (2008)]. 

4.1. Number of clusters in the PBC910 data. Table 4 shows calculated 
values of the penalized expected deviance for the PBC910 data and models 
with K = 1,2,3,4 clusters. It shows that the two-component model fits the 
data clearly better than a model with just a single cluster. On the other 
hand, the three clusters are already too much for these data. Even though 
the expected deviance of the three-component model is lower than that of 
the two-component model, the decrease of the expected deviance is overcome 
by the penalization for the additional component. 

The conclusion that the third cluster is redundant for our application is 
also supported by the fact that when a three-component model is fitted to 
the PBC910 data, the two components almost coincide with the mixture 
components from the K = 2 model and the estimated weight of the addi- 
tional component is only w^ = 0.021, and only three patients are allocated 
here using the rule based on a maximal value of the posterior component 
probability. 

Table 4 

PBC910 data. Penalized expected deviance for 

models with K = 1,2,3,4 clusters 



K PED E{D(^,6)\y} p opt 



1 


14,277.9 


14,241.8 


36.1 


2 


14,164.1 


14,088.3 


75.8 


3 


14,183.1 


14,057.1 


126.0 


4 


22,405.1 


17,244.4 


5160.8 
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5. A simulation study. 

5.1. Simulation setup. The setup of the simulation study was motivated 
by the PBC910 application and the data were generated according to the 
model (8). For each subject i and each marker r, rtj )r = 4 visit times were 
generated, with the first visit time being equal to and the remaining 
three visit times being generated from uniform distributions on intervals 
(170,200), (350,390), (710,770) days, respectively. The covariate U,rj in (8) 
was the visit time in months. The GLMM related parameters were equal: 
o~i = \/</>i = 0.3, «3 = 0.05, and we tried two values for the true number of 
clusters: K = 2 and K = 3. In the two-cluster data, both mixture weights 
were rather high: w = (0.6,0.4) T , whereas in the three-cluster data, a small 
third component with W3 = 0.06 was created by splitting the second compo- 
nent of the two-cluster setting, leading to the weights w = (0.60, 0.34, 0.06) T . 
To make the differences between the clusters less obvious, not all elements 
of the mixture means varied across the clusters; see Table 5. Namely, in 
data with K = 2, both clusters shared the same value of the mean slope 
of the Gaussian response and also the same value of the mean intercept of 
the Poisson response. There were also some differences introduced across the 
mixture covariance matrices; see Table D.l in the Supplement [Komarek and 
Komarkova (2013)]. Example data sets generated according to considered 
simulation settings are also shown on Figures D.l and D.2 of the Supplement. 

To examine the performance of our method in situations when there 
is misspecification in the random effects distribution, we simulated data 
not only under the normal distribution of random effects, but also under 
the shifted-scaled multivariate ^-distribution with five degrees of freedom 
(MVT5). For each setting (K, distribution of random effects), we tried three 
values of sample sizes with total numbers of subjects being 50, 100,200. For 
each setting and each sample size, 100 data sets were generated. The pos- 
terior inference for each data set was based on 10,000 iterations of 1:100 
thinned MCMC obtained after a burn-in period of 1000 iterations. 

5.2. Parameter estimates. The left block of Table 5 shows the square 
roots of the mean squared errors (MSE) in the estimation of the most im- 
portant model parameters (mixture weights and means, GLMM fixed effects) 
provided that the number of clusters, K, is correctly specified. As parameter 
estimates, we considered the posterior means, and for a particular parame- 
ter the reported MSE is the average MSE over the parameter values from 
all K components. Detailed results also providing the bias and the stan- 
dard deviations of the posterior means are given in Tables D.3 — D.6 of the 
Supplement [Komarek and Komarkova (2013)]. 

With normally distributed random effects, the posterior means of model 
parameters seem to provide consistent estimates of the model parameters. 
The same can also be concluded when the true distribution of random effects 
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Simulation study: (a) proportions of models selected with K = 1, 2, 3, 4 using the PED criterion; (b) total classification error rate from a 

model with correctly specified K; (c) true values of mixture weights and mixture means; (d) square roots of the mean squared errors 

(MSE), where the calculated MSE is based on posterior means as parameter estimates. For each parameter, the reported MSE is the 

average MSE over the K mixture components. The N gives the true number of subjects in each cluster 
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is MVT5. Nevertheless, not surprisingly, the convergence of the parameter 
estimates to their true values is in general slower in this case. 

5.3. Classification error rates. With respect to classification, one of the 
most important measures is the classification error rate reported in the ninth 
column of Table 5. More detailed results, also showing conditional (given the 
cluster) classification error rates, are shown in Table D.2 of the Supplement 
[Komarek and Komarkova (2013)]. Among other things, we point out that 
even with incorrectly specified distribution of random effects, the classifica- 
tion error rates do not differ considerably from the error rates obtained with 
data for which the random effects distribution was correctly specified, and 
even with a moderate sample size of 200 subjects, the achieved error rate is 
as low as 10%. 

5.4. Selection of a number of clusters. Performance of the penalized ex- 
pected deviance as a tool for the selection of a number of clusters is il- 
lustrated by the right block of Table 5. For each simulated data set, we 
estimated four models, each of them under the assumption of a different 
number of clusters, namely, K = 1,2,3,4, and calculated the corresponding 
PED values. Table 5 shows proportions of data sets for which a particular 
number of clusters was selected by minimizing the PED. For each simula- 
tion setting, bold numbers indicate proportions of data sets with a correctly 
selected number of clusters. 

For data sets composed of two clusters both having rather high weights, 
the probability of a selection of a correct model increases with the sam- 
ple size, practically reaching 100% with N = 200, and normally distributed 
random effects. The situation is slightly worse when the true distribution of 
random effects is MVT5. Nevertheless, the results are also rather satisfactory 
in this case. 

When there were three clusters present, with one of the clusters having 
a rather small weight of 0.06, the probability of a correct selection of a 
number of clusters also increases with the sample sizes (for both normally 
and MVT5 distributed random effects). However, it is much slower than in 
the case of two clusters of almost equal size. Nevertheless, we point out that 
already for the moderate sample size of 200 subjects, in 97% and 94% of 
cases, respectively, the number of clusters indicated by the value of PED 
differs by at most one from the correct value of three. 

6. Discussion. In this paper we have proposed a method for classification 
of subjects on the basis of longitudinal measurements of several outcomes 
of a different nature which, according to the best of our knowledge, has 
not been considered in the literature yet. The clustering procedure relies on 
a classical GLMM specified for each marker, whereas possible dependence 
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across the values of different markers is captured by specifying a joint distri- 
bution of all random effects. In contrast to a classical assumption, we assume 
a normal mixture in the random effects distribution which is the core clas- 
sification component of the proposed model. Although other choices of the 
basis distributions could be considered as well, we would like to stress the 
fact that with the Bayesian approach used here, the normality assumption 
only corresponds to a particular choice of just one component of the overall 
prior distribution which is updated by the data. For example, the posterior 
predictive distribution for random effects is given by 






P P rcd(b) = / < 2_jW k (p(b;fi k ,~D k ) \p(9\y)d0, 

which in general is no longer a Gaussian mixture. In a similar way, the pos- 
terior distribution enters the calculation of the posterior component prob- 
abilities (9) while taking into account the uncertainty in the specification 
of the distribution of random effects. Last but not the least, the simulation 
study also suggests that, at least in situations when the true random effects 
distribution is symmetric with heavier tails, assuming a priori a normal ran- 
dom effects distribution does not have any crude impact on classification 
error rates. 

Being within the Bayesian framework, it is also relatively easy to calculate 
not only point estimates of the individual component probabilities, but also 
corresponding credible intervals which can be subsequently used to evaluate 
uncertainty in the pertinence of a particular subject in a specific cluster. Such 
uncertainty is only rarely evaluated in similar situations. Finally, we adapted 
recently published methodology for model comparison based on the concept 
of a penalized expected deviance to explore the optimal number of clusters. 

Further, we point out that even though we classified in this paper only 
those subjects who were also used to draw the posterior inference on the 
model parameters, our procedure can also be used to classify a new subject 
with a value of observed longitudinal markers equal to y ncw and unknown 
allocation u ncw . Indeed, given the posterior sample Sm from p{ip,0\y), the 
component probabilities Pnew,k(' t P ,& ) and estimated posterior compo- 
nent probabilities vr n0 w,fc (k = 1, . . . , K) can be calculated using expressions 
(10) and (9), and then used for classification. Finally, it is worth mentioning 
that discriminant analysis, where a training data set with known cluster 
allocation is available, would also be possible with only slightly modified 
methodology where separate posterior samples would be drawn using the 
data from the various clusters and then used to calculate the component 
probabilities. 

For practical analysis, we extended the R package mixAK [Komarek (2009)] 
to cover the methodology proposed in this paper. The package is freely avail- 
able from the Comprehensive R Archive Network at http://cran.r-project.org/. 
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SUPPLEMENTARY MATERIAL 

Appendices (DOI: 10.1214/12-AOAS580SUPP; .pdf). The pdf file con- 
tains (A) more detailed description of the assumed prior distribution for 
model parameters, giving also some guidelines for the selection of the hy- 
perparameters to achieve a weakly informative prior distribution; (B) more 
details on the posterior distribution and the sampling MCMC algorithm; 

(C) additional information to the analysis of the Mayo Clinic PBC data; 

(D) more detailed results of the simulation study. 
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